This study will explore the changes and evolution in the language of Disney Pictures movies over time through the computational lenses of topic modelling and sentiment analysis. We decided to focus on spoken language in these motion pictures, which proved to be a complex problem to tackle, given the peculiarity of oral dialogues, especially in pictures aimed at children.
We focused from the start on a computational method which could give us insight into the peculiarities of each movie and, possibly, into parallels or similarities between different movies in different historical moments.
The reason behind the choice of our subject is a deep fascination for the imaginary worlds crafted by Disney Pixar and for the language used in the company’s motion pictures. Since these movies span almost a hundred years, ranging from classic to contemporary, it was also the perfect opportunity to study chilren-centric language from a diachronic point of view. The challenge mainly consisted in the normalization and treatment of such orally-bound and children-specific language in such a way that could allow us to gather meaningful insight.
Since the object of our study was a collection of oral texts extracted from movies, we quickly identified many challenges such as the oral nature of these texts and the widespread use of narrative expedients that complicate computational processing, such as flashbacks. We needed an additional analysis that would take into consideration this specificity and thus offer us a better understanding of the meaning and implications in the plot.
When we took up the project we chose to include in the study any and all motion pictures from Disney Pixar studios up until that moment. When this choice was made, 59 movies had been created and distributed, and were therefore included, from 1937 to 2021.
After gathering and cleaning the textual data two main experiments were carried out:
1. We firstly used a tool called MALLET - a Java-based package for natural language processing - to extract n clusters of topics from each movie, finally allowing us to group motion pictures similar in themes.
2. Concurrently, we experimented with the Syuzhet library for the programming language R to compare the differences - or lack thereof - of sentimental valence among movies belonging to the same thematic cluster.
TODO: Conclusion (we found that…)
From their documentation
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
This package was the starting point of our analysis, as it allowed us - interested humanists - to rise to the challenge of complex computational analyses without the steep learning curve of more complex tools. The CLI - command line interface, the kind of application or toolkit one can use via terminal commands - was the perfect balance between control over the analysis and output and complex computations which required programming knowledge in Java.
Additionally, the toolkit is OpenSource software, released under the Apache 2.0 Licence, and widely used in the field, which also meant there is a large number of resources and solutions to common problems online.
Syuzhet is one of the two terms describing a narrative composition, along with the fabula, theorized by Russian Formalists Victor Shklovsky and Vladimir Propp. It refers to the “device” or technique of a narrative and is concerned with the manner in which the components of a story are organized. This is the name chosen for an R language based package specifically targeted at natural language processing analyses. Its main goal is making NLP and especially sentiment analysis in textual data widely available in a simple and direct way.
After deciding on the time window of reference for the research, which spans from 1937 (the year when Snow White and the Seven Dwarfs was released) to 2021 (the year this research first started), we needed to gather all relevant titles. The Wikipedia page for Disney movies felt like the perfect place to start. We downloaded the html page using the requests module for Python and subsequently parsed the document tree using beautifulsoup, an XML and HTML parsing library in Python.
from bs4 import BeautifulSoup, PageElement
import json
import requests
# Wikipedia page for "Disney Movies"
DISNEY_URI = "https://en.wikipedia.org/wiki/List_of_Walt_Disney_Animation_Studios_films"
# Retrieve the webpage in HTML format
response = requests.get(DISNEY_URI)
if response.status_code >= 200:
with open("disney_titles.html", "r") as fp:
data = fp.read()
# Turn html string into Soup object
soup = BeautifulSoup(data, features="lxml")
# retrieve all table rows containing movie titles
rows: List[PageElement] = soup.find_all("tr")
disney_movies = dict()
# Find first td element (title) and second td element (year) and aggregate in dict object
for row in rows:
disney_movies[row.find_all("td")[0].text.strip("\n")] = {"release_date": row.find_all("td")[1].text.strip("\n").replace(" ", " ")}
# Save dict as disney_titles.json
with open("disney_titles.json", "w") as outfile:
json.dump(disney_movies, outfile)
Once this step was over we had a JSON file containing a map of each movie title together with its release year, in the format { "Snow White and the Seven Dwarfs": {"year": 1937} }. To gather all subtitles for these files we needed an open collection of subtitles, and we found OpenSubtitles’ service to fit perfectly our needs. They provide an open REST API, so after obtaining a key and getting comfortable with the documentation we quickly turned the list of titles in JSON into a folder of .srt files. .srt files are very easy to work with, since they are written in plain text and the formatting is very predictable. Since at this stage we were working in Python, we decided to clean the raw subtitles using a Python library called pysrt which proved essential to extracting textual data from the .srt files. Concurrently, we noticed that many texts were rich with html tags, descriptions of surroundings, advertisements and so on. While gathering the texts we therefore also started cleaning them. This is one of the functions used to remove unwanted textual data from our subtitles:
def parse_subs() -> Dict[str, list]:
"""
Turn subtitle files in object:
{"movie_name_YEAR": ["text"], ...}
"""
subs_directory = "subs/"
final_object = {}
for file in os.listdir("./subs"):
# Parse .srt file for easier handling
try:
srt = pysrt.open(subs_directory + file)
except UnicodeDecodeError:
print(f"Error handling file: {file}\nSkipping...")
# Remove opensubtitles ads and intro:
opensubs_ads = r'(♪)|(Advertise your product or brand here)|(contact www\.OpenSubtitles\.(org|com) today)|(Support us and become VIP member)|(to remove all ads from www\.OpenSubtitles\.(org|com))|(-== \[ www\.OpenSubtitles\.(org|com) \] ==-)|((((Subtitles by )|(Sync by ))(.+))$)|(font color="(.+)?")|(Provided by(.+)$)|(^(https?):\/\/[^\s\/$.?#].[^\s]*$)|(Please rate this subtitle at (.)+$)|(Help other users to choose the best subtitles)'
remove_ads = re.sub(re.compile(opensubs_ads), "", srt.text)
# Remove html tags, dashes (dialogues), returns
remove_curly = re.sub(re.compile(r"\{.*?\}"), "", remove_ads)
remove_html = re.sub(re.compile(r"((<[^>]+>)+)"), " ", remove_curly)
remove_html_closing = re.sub(re.compile(r"((<\/[^>]+>)+)"), " ", remove_html)
remove_dashes = re.sub(re.compile(r"-\s"), " ", remove_html_closing)
remove_returns = re.sub(re.compile(r"[\r\t\n]"), " ", remove_dashes)
# allowed_chars = string.ascii_letters + " " + "'" + "-" + "."
# remove_punctuation_lowercase = "".join([char.lower() for char in remove_returns if char in allowed_chars])
remove_double_spaces = re.sub(re.compile(r"(\s+)"), " ", remove_returns)
remove_starting_spaces = re.sub(re.compile(r"(^\s)"), "", remove_double_spaces)
year = file.split("_")[-1].strip(".srt")
title = "_".join(file.split("_")[:-1])
final_object[title] = [year, remove_starting_spaces]
return final_object
Finally, we were done scraping and cleaning data. At this point the output of the first round was pickled (serialized in a python-specific library) for future manipulation and saved as the first dataset.
After data has been scraped and got through a first cleaning stage, we made further adjustments in order to optimize Mallet’s tasks.
The input files for Mallet come are copies of the web scraped personal and organizations’ names are cleaned out from the text using spacy’s en_core_web_sm., since in previous trials with Mallet we found them out to be noisy.
Further cleaning consists in the deployment of nltk’s POS tagger has been used to extract only thos words labeled as NN (i.e., nouns) and equal or greater the 4 characters; a regex was also inserted to remove words with apostrophes (e.g., “ya’ll”) that were missed by both spacy’s parser and nltk’s tagger.
The resulting files were saved into directory cartoonlp/nn_txts.
from nltk.tokenize import word_tokenize
import spacy
import os.path
import nltk
import re
NER = spacy.load("en_core_web_sm")
path = "nn_txts/"
for file in os.listdir("./txts"):
with open("./txts/"+file, "r") as new_file:
text = new_file.read()
stripped_text = []
parsed = NER(text)
for word in parsed.ents: #automatic detection of person and organizations to remove
if word.label_ == "PERSON" or word.label_ == "ORG":
text = text.replace(str(word), "")
tokens = word_tokenize(text)
tagged = nltk.pos_tag(tokens)
for word, tag in tagged:
if tag == 'NN' and len(word)>4: #adding to a new string only tosewords recognised as nouns and longer then 4 characters
stripped_text.append(word)
for word in stripped_text:#remove words with aphostrophes such as pronouns
if re.search(r"\w+[']\w+?",word):
stripped_text.remove(str(word))
new_string=" ".join(str(x) for x in stripped_text)
out_file=open(path+file,"w")
out_file.write(new_string)
out_file.close()
The topic modelling was released working from the shell with Mallet.
First we imported the directory with pre-processed files in the nn_txts directory into into Mallet and removed English stop words if any detected with command --remove-stopwords.
mallet import-dir \
--input sample-data/nn_txts \
--output disney_topics.mallet \
--keep-sequence --remove-stopwords
Then we an exploratory phase in order to understand which parameters were the most appropriate for a small corpus as ours and function 1 was further refined into the final form showed above here. Here we iterated mallet multiple times over the corpus, changing parameters as the cluster quality improved in our opinion and KK/log
We detected as useful input parameters for train-topics:
–num-topics: the actual number of topics retrieved from mallet
–optimize-burn-in: The number of iterations before hyper-parameter optimization begins. Default is twice the optimize interval.1
We started with a low number of topics i.e., 6 and started increasing it up to 15, at this point we decided clusters where satisfying: each cluster was intelligible, homogeneous and words made sense with the movies they were assigned to.
Burn in was also raised to 60 since we noticed it helped with topics’ homogenization.
mallet train-topics --input disney_topics.mallet \
--num-topics 15 \
--optimize-burn-in 60 \
--output-state disney-topic-state.gz \
--output-topic-keys disney_keys.txt \
--output-doc-topics disney_composition.csv \
--xml-topic-report disney_report.xml
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Year Title T1 T2
## 1 1937 Snow_White_and_the_Seven_Dwarfs.txt 0.0013831259 0.0677731674
## 2 1940 Fantasia_2000.txt 0.0977011494 0.1666666667
## 3 1940 Pinocchio.txt 0.0008012821 0.0248397436
## 4 1941 Dumbo.txt 0.0011037528 0.0242825607
## 5 1942 Bambi.txt 0.0024691358 0.0024691358
## 6 1942 Saludos_Amigos.txt 0.0051057622 0.0007293946
## T3 T4 T5 T6 T7 T8
## 1 0.0013831259 0.005532503 0.022130014 0.001383126 0.001383126 0.22544952
## 2 0.0028735632 0.002873563 0.002873563 0.011494253 0.002873563 0.03735632
## 3 0.0128205128 0.017628205 0.039262821 0.017628205 0.491185897 0.13782051
## 4 0.0309050773 0.011037528 0.037527594 0.044150110 0.001103753 0.24282561
## 5 0.0024691358 0.009876543 0.024691358 0.002469136 0.002469136 0.18765432
## 6 0.0007293946 0.020423049 0.029175784 0.016046681 0.026987600 0.03136397
## T9 T10 T11 T12 T13 T14
## 1 0.242047026 0.0013831259 0.0013831259 0.0013831259 0.14246196 0.183955740
## 2 0.002873563 0.0373563218 0.0632183908 0.1063218391 0.45977011 0.002873563
## 3 0.056089744 0.0008012821 0.0008012821 0.0008012821 0.14983974 0.048878205
## 4 0.020971302 0.4116997792 0.0275938190 0.0011037528 0.03421634 0.007726269
## 5 0.009876543 0.0098765432 0.0024691358 0.0246913580 0.50617284 0.017283951
## 6 0.048869438 0.4668125456 0.0335521517 0.1889132020 0.09919767 0.022611233
## T15
## 1 0.1009681881
## 2 0.0028735632
## 3 0.0008012821
## 4 0.1037527594
## 5 0.1950617284
## 6 0.0094821298
Movies in mallet_values.csv were chronologically ordered and plotted into a stack bar
First we want to look at the count of movies for each topic, to see how movies are distributed among topics. To do so we have plotted a line chart for each topic. We noticed that all the topics have a few movies in which their weight value is above 0.20 and selected this number as a minimum threshold for choosing which movie to include in which cluster.
\[inserisci screen dei grafici di numbers\]
#Create columns for movies' release dates and titles
date <- paste(mallet_values$Year)
movie <- paste(mallet_values$Title)
#Create columns with binary values for each tpoic -- example here with T1
T1 <- mallet_values$T1
vT1 <- 0.20
Topic1 <- vector()
for (v in T1) {
if (v>= vT1) {
Topic1<- c(Topic1, 1)
} else {
Topic1<-c(Topic1,0)
}
}
print(Topic1)
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [39] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0
#create matrix
matrix<-data.frame(date,movie,Topic1,Topic2,Topic3,Topic4,Topic5,Topic6,Topic7,Topic8,Topic9,Topic10,Topic11,Topic12,Topic13,Topic14,Topic15)
matrix
## date movie Topic1 Topic2 Topic3 Topic4
## 1 1937 Snow_White_and_the_Seven_Dwarfs.txt 0 0 0 0
## 2 1940 Fantasia_2000.txt 0 0 0 0
## 3 1940 Pinocchio.txt 0 0 0 0
## 4 1941 Dumbo.txt 0 0 0 0
## 5 1942 Bambi.txt 0 0 0 0
## 6 1942 Saludos_Amigos.txt 0 0 0 0
## 7 1944 The_Three_Caballeros.txt 0 0 0 0
## 8 1946 Make_Mine_Music.txt 0 0 0 0
## 9 1947 Fun_and_Fancy_Free.txt 0 0 0 0
## 10 1949 The_Adventures_of_Ichabod_and_Mr._Toad.txt 0 0 0 0
## 11 1950 Cinderella.txt 0 0 0 0
## 12 1951 Alice_in_Wonderland.txt 0 0 0 0
## 13 1953 Peter_Pan.txt 0 0 1 0
## 14 1955 Lady_and_the_Tramp.txt 0 0 0 0
## 15 1958 Melody_Time.txt 0 0 0 0
## 16 1959 Sleeping_Beauty.txt 0 1 0 0
## 17 1961 One_Hundred_and_One_Dalmatians.txt 0 0 0 0
## 18 1963 The_Sword_in_the_Stone.txt 0 1 0 0
## 19 1967 The_Jungle_Book.txt 1 0 0 0
## 20 1970 The_Aristocats.txt 0 0 0 1
## 21 1973 Robin_Hood.txt 0 0 0 1
## 22 1977 The_Many_Adventures_of_Winnie_the_Pooh.txt 0 0 0 0
## 23 1977 The_Rescuers_Down_Under.txt 0 0 1 0
## 24 1981 The_Fox_and_the_Hound.txt 0 0 0 0
## 25 1985 The_Black_Cauldron.txt 0 0 0 0
## 26 1986 The_Great_Mouse_Detective.txt 0 0 0 0
## 27 1988 Oliver_&_Company.txt 0 0 0 1
## 28 1989 The_Little_Mermaid.txt 0 0 0 0
## 29 1990 The_Rescuers.txt 0 0 1 0
## 30 1991 Beauty_and_the_Beast.txt 0 0 0 0
## 31 1992 Aladdin.txt 0 0 0 1
## 32 1994 The_Lion_King.txt 0 0 0 0
## 33 1995 Pocahontas.txt 0 0 0 0
## 34 1996 The_Hunchback_of_Notre_Dame.txt 0 1 0 0
## 35 1997 Hercules.txt 0 0 0 0
## 36 1998 Mulan.txt 0 0 0 0
## 37 1999 Fantasia.txt 0 0 0 0
## 38 1999 Tarzan.txt 0 0 0 0
## 39 2000 Dinosaur.txt 0 0 0 0
## 40 2000 The_Emperor's_New_Groove.txt 0 0 1 0
## 41 2001 Atlantis_The_Lost_Empire.txt 0 0 0 0
## 42 2002 Treasure_Planet.txt 0 0 1 0
## 43 2003 Brother_Bear.txt 0 0 0 0
## 44 2004 Home_on_the_Range.txt 0 0 0 1
## 45 2005 Chicken_Little.txt 0 0 0 0
## 46 2007 Lilo_&_Stitch.txt 0 0 0 0
## 47 2007 Meet_the_Robinsons.txt 0 0 0 0
## 48 2008 Bolt.txt 0 0 0 0
## 49 2009 The_Princess_and_the_Frog.txt 0 0 0 0
## 50 2010 Tangled.txt 0 1 0 0
## 51 2011 Winnie_the_Pooh.txt 0 0 0 0
## 52 2012 Wreck-It_Ralph.txt 1 0 0 0
## 53 2013 Frozen_II.txt 0 0 0 0
## 54 2014 Big_Hero_6.txt 0 0 0 0
## 55 2016 Moana.txt 0 0 0 0
## 56 2016 Zootopia.txt 0 0 0 0
## 57 2018 Ralph_Breaks_the_Internet.txt 1 0 0 0
## 58 2019 Frozen.txt 0 0 0 0
## 59 2021 Raya_and_the_Last_Dragon.txt 0 0 0 0
## Topic5 Topic6 Topic7 Topic8 Topic9 Topic10 Topic11 Topic12 Topic13 Topic14
## 1 0 0 0 1 1 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 1 0
## 3 0 0 1 0 0 0 0 0 0 0
## 4 0 0 0 1 0 1 0 0 0 0
## 5 0 0 0 0 0 0 0 0 1 0
## 6 0 0 0 0 0 1 0 0 0 0
## 7 0 0 0 0 0 1 0 0 0 0
## 8 0 0 0 0 0 0 0 0 1 0
## 9 0 0 0 0 0 0 0 0 0 1
## 10 0 0 0 0 0 0 0 1 0 0
## 11 0 0 0 1 0 0 0 0 1 0
## 12 0 0 0 0 0 0 0 0 0 1
## 13 0 0 0 1 0 0 0 0 0 0
## 14 0 0 0 1 0 0 0 0 0 0
## 15 1 0 0 0 0 0 0 0 0 0
## 16 0 0 0 0 0 0 0 0 0 0
## 17 0 0 1 1 0 0 0 0 0 0
## 18 0 0 0 0 0 0 0 0 0 0
## 19 0 0 0 1 0 0 0 0 0 0
## 20 0 0 0 1 0 0 0 0 0 0
## 21 0 0 0 0 0 0 0 0 0 0
## 22 0 1 0 1 0 0 0 0 0 0
## 23 0 0 0 0 0 0 0 0 0 0
## 24 0 0 0 1 0 0 0 0 0 0
## 25 1 0 0 0 0 0 0 0 0 1
## 26 0 1 0 0 0 0 0 0 0 0
## 27 0 0 0 0 0 0 0 0 0 0
## 28 0 0 0 0 0 0 0 0 0 0
## 29 0 0 0 1 0 0 0 0 0 0
## 30 0 0 0 0 0 0 0 0 0 1
## 31 0 0 0 0 0 0 0 0 0 0
## 32 0 0 0 1 1 0 0 0 0 0
## 33 1 0 0 0 1 0 0 0 0 0
## 34 0 0 0 0 0 0 0 0 0 0
## 35 1 0 0 0 0 0 0 0 1 0
## 36 1 0 0 0 0 0 0 0 0 0
## 37 0 0 0 0 0 0 0 0 1 0
## 38 1 0 0 1 1 0 0 0 0 0
## 39 0 1 0 0 1 0 0 0 0 0
## 40 0 0 0 0 0 0 0 0 0 0
## 41 0 0 0 0 0 0 0 1 0 0
## 42 0 0 0 0 0 0 0 0 0 0
## 43 0 0 0 0 1 0 0 0 0 0
## 44 0 0 0 0 0 0 0 0 0 0
## 45 0 0 0 0 0 0 1 0 0 0
## 46 0 0 0 1 0 1 0 0 0 0
## 47 0 0 0 0 0 0 1 0 0 0
## 48 1 0 0 0 0 0 0 0 0 0
## 49 0 0 0 0 0 0 0 0 0 0
## 50 0 0 0 0 0 0 0 0 0 0
## 51 0 1 0 0 0 0 0 0 0 0
## 52 0 0 0 0 0 0 0 0 0 0
## 53 1 0 0 0 0 0 0 0 0 0
## 54 0 0 0 0 0 0 1 0 0 0
## 55 0 0 0 0 1 0 0 0 0 0
## 56 0 0 1 0 0 0 0 0 0 0
## 57 0 0 0 0 0 0 0 0 0 0
## 58 1 0 0 0 0 0 0 0 0 0
## 59 0 0 0 0 0 0 0 0 0 0
## Topic15
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## 7 0
## 8 0
## 9 0
## 10 0
## 11 0
## 12 0
## 13 0
## 14 0
## 15 0
## 16 0
## 17 0
## 18 0
## 19 0
## 20 0
## 21 0
## 22 0
## 23 0
## 24 0
## 25 0
## 26 0
## 27 0
## 28 0
## 29 0
## 30 0
## 31 1
## 32 0
## 33 0
## 34 0
## 35 0
## 36 0
## 37 0
## 38 0
## 39 0
## 40 1
## 41 0
## 42 0
## 43 0
## 44 0
## 45 0
## 46 0
## 47 0
## 48 0
## 49 1
## 50 0
## 51 0
## 52 0
## 53 1
## 54 0
## 55 0
## 56 0
## 57 0
## 58 0
## 59 1
movies <- c()
dates <-c()
t_weight<-c()
for (i in rownames(matrix)) {
title<- matrix[i, "movie"]
date<- matrix[i, "date"]
row <- mallet_values[match(title,mallet_values$Title)+1, ]
w <- row$T1
if (matrix[i, "Topic1"] ==1){movies<- c(movies, title)
dates<- c(dates, date)
t_weight<- c(t_weight, w)}
}
cluster_T1 <-data.frame(movies, dates, t_weight)
cluster_T1
## movies dates t_weight
## 1 The_Jungle_Book.txt 1967 0.05939716
## 2 Wreck-It_Ralph.txt 2012 0.03624901
## 3 Ralph_Breaks_the_Internet.txt 2018 0.02752976
thing friend medal jungle wheel video arcade stuff track racer glitch building today virus man-village buddy inurity march credit princess
## movies dates t_weight
## 1 Sleeping_Beauty.txt 1959 0.4325530
## 2 The_Sword_in_the_Stone.txt 1963 0.4678250
## 3 The_Hunchback_of_Notre_Dame.txt 1996 0.2336011
## 4 Tangled.txt 2010 0.5279748
dream world birthday child kingdom tower magic stone sword power witch crown blood flower story miracle tomorrow gleam light castle
## movies dates t_weight
## 1 Peter_Pan.txt 1953 0.4651852
## 2 The_Rescuers_Down_Under.txt 1977 0.3828829
## 3 The_Rescuers.txt 1990 0.4395712
## 4 The_Emperor's_New_Groove.txt 2000 0.3060960
## 5 Treasure_Planet.txt 2002 0.5076209
captain treasure diamond emperor pirate llama world order silver leader shadow flight singing house woman cyborg career chief shirt cliff
## movies dates t_weight
## 1 The_Aristocats.txt 1970 0.2641844
## 2 Robin_Hood.txt 1973 0.3610411
## 3 Oliver_&_Company.txt 1988 0.2983947
## 4 Aladdin.txt 1992 0.2201722
## 5 Home_on_the_Range.txt 2004 0.4520104
money street sheriff woman mouth kitty uncle range horse carpet reward partner trail alley sultan outta church minute property permission
## movies dates t_weight
## 1 Melody_Time.txt 1958 0.3809524
## 2 The_Black_Cauldron.txt 1985 0.4414414
## 3 Pocahontas.txt 1995 0.3390640
## 4 Hercules.txt 1997 0.2526455
## 5 Mulan.txt 1998 0.4562963
## 6 Tarzan.txt 1999 0.2733333
## 7 Bolt.txt 2008 0.3848712
## 8 Frozen_II.txt 2013 0.3672183
## 9 Frozen.txt 2019 0.4784226
thing heart family father chance river sister moment truth point daughter fault spirit death danger strength question choice sword place
## movies dates t_weight
## 1 The_Many_Adventures_of_Winnie_the_Pooh.txt 1977 0.5537998
## 2 The_Great_Mouse_Detective.txt 1986 0.4223541
## 3 Dinosaur.txt 2000 0.2229039
## 4 Winnie_the_Pooh.txt 2011 0.5779645
thing honey fellow goodness friend doctor house moment tummy narrator brain queen stuff mouse thought prize sense chapter bother bottle
## movies dates t_weight
## 1 Pinocchio.txt 1940 0.4911859
## 2 One_Hundred_and_One_Dalmatians.txt 1961 0.3540490
## 3 Zootopia.txt 2016 0.5849822
bunny savage world father conscience actor school plenty predator otter crime officer chain number couple whale traffic alert police system
## movies dates t_weight
## 1 Snow_White_and_the_Seven_Dwarfs.txt 1937 0.2254495
## 2 Dumbo.txt 1941 0.2428256
## 3 Cinderella.txt 1950 0.2100089
## 4 Peter_Pan.txt 1953 0.2318519
## 5 Lady_and_the_Tramp.txt 1955 0.5228938
## 6 One_Hundred_and_One_Dalmatians.txt 1961 0.4105461
## 7 The_Jungle_Book.txt 1967 0.3361742
## 8 The_Aristocats.txt 1970 0.2349291
## 9 The_Many_Adventures_of_Winnie_the_Pooh.txt 1977 0.2513168
## 10 The_Fox_and_the_Hound.txt 1981 0.3776758
## 11 The_Rescuers.txt 1990 0.2202729
## 12 The_Lion_King.txt 1994 0.3257230
## 13 Tarzan.txt 1999 0.2093333
## 14 Lilo_&_Stitch.txt 2007 0.2067968
place night mother minute morning friend thing matter trouble house surprise hurry business earth tonight creature goodness charge today devil
## movies dates t_weight
## 1 Snow_White_and_the_Seven_Dwarfs.txt 1937 0.2420470
## 2 The_Lion_King.txt 1994 0.2663623
## 3 Pocahontas.txt 1995 0.2244508
## 4 Tarzan.txt 1999 0.2813333
## 5 Dinosaur.txt 2000 0.3087935
## 6 Brother_Bear.txt 2003 0.4429224
## 7 Moana.txt 2016 0.6515152
heart water brother island village mountain world ocean voice monster share story earth stuff mission board chicken journey darkness ground
## movies dates t_weight
## 1 Dumbo.txt 1941 0.4116998
## 2 Saludos_Amigos.txt 1942 0.4668125
## 3 The_Three_Caballeros.txt 1944 0.5375218
## 4 Lilo_&_Stitch.txt 2007 0.3868402
gaucho plane circus angel motion elephant climax planet samba potato peanut shelter knife picture lilongo stand saddle roller stitch feather
## movies dates t_weight
## 1 Chicken_Little.txt 2005 0.3483299
## 2 Meet_the_Robinsons.txt 2007 0.6327313
## 3 Big_Hero_6.txt 2014 0.5737110
future today machine family science school buddy garage chance cover story class question invention baseball robot project problem companion control
## movies dates t_weight
## 1 The_Adventures_of_Ichabod_and_Mr._Toad.txt 1949 0.4727815
## 2 Atlantis_The_Lost_Empire.txt 2001 0.3826788
dream bridge grandfather crystal adventure power round excitement motorcar paper schoolmaster source price court country mania language flight police decision
## movies dates t_weight
## 1 Fantasia_2000.txt 1940 0.4597701
## 2 Bambi.txt 1942 0.5061728
## 3 Make_Mine_Music.txt 1946 0.5899796
## 4 Cinderella.txt 1950 0.4781055
## 5 Hercules.txt 1997 0.2189153
## 6 Fantasia.txt 1999 0.6557971
music heart story hurry dress spring number stuff dream window sound slipper beauty romance picture country matter tonight sweet glass
## movies dates t_weight
## 1 Fun_and_Fancy_Free.txt 1947 0.2018197
## 2 Alice_in_Wonderland.txt 1951 0.6504254
## 3 The_Black_Cauldron.txt 1985 0.2136422
## 4 Beauty_and_the_Beast.txt 1991 0.5483405
beast master castle monster father world watch party rabbit trouble afternoon child apple chance dinner pardon fault spell advice guest
## movies dates t_weight
## 1 Aladdin.txt 1992 0.2681427
## 2 The_Emperor's_New_Groove.txt 2000 0.2166018
## 3 The_Princess_and_the_Frog.txt 2009 0.4825248
## 4 Frozen_II.txt 2013 0.3435776
## 5 Raya_and_the_Last_Dragon.txt 2021 0.5064836
prince world princess magic water voice dragon future night palace forest today daughter sense problem light thing reason bayou restaurant
Abbiamo visto che alcuni cluster non hanno molto senso tipo l’8 ma perché è orale il linguaggio e ci sta che mallet faccia cazzate, pero! Però, alcuni avevano senso per fortune ed abbiamo visto alcune cose ovvero queste:
Topic two classico da favola presente maggiormente a cavallo degli anni ’60. Torna in modo prominentenel 2010.
Sentiment: paragone 1959 e 1963 perché sono vicini e poi vedere che cambia in 2010
Topic three Advebture compare 1953, 1990 e 2002
Topic five 1997-1999 belli duye donne due uomin come protagonisti e hanno dei weight che sembrano correlati al sesso
T8 spieghiamo cosa potrebbe rppresentare, mma senza fagli syuzhet oral-bound/esclamazioni/linguaggio generale da cartoni animati
T9 Wandering and perception of nature da 1999-2016 forte aumento, vediamo se incide sulla sentiment
T11 high-tech vedere come è percepito nei dialoghi e comparare 2007 e 2014
T13 linguaggio da favola dei film rd in comune hanno colonna sonoraominante sul testo (babmbi, make mine music e cinderella), vicini di anni capire se c’è una correlazione nel cluster
T14 mondo FANTASTICO/magia paragone 1951 e 1991
T15 principesse emancipate 2009 2013 2021
First we import the Syuzhet package and read the csv file containing all the films with the tokenized sentences.
library(syuzhet) #enables Syuzhet oackage
library(dplyr) #enables glimpse()
library(rmarkdown) #for pretty prints
df <- read.csv(url("https://raw.githubusercontent.com/fcagnola/cartoonlp/main/03_out_dataframe.csv"))
glimpse(df)
## Rows: 59
## Columns: 5
## $ X <chr> "Chicken_Little", "Frozen", "Bolt", "Fantasia_2000"…
## $ Year <int> 2005, 2013, 2008, 1999, 1949, 1940, 2012, 1942, 200…
## $ Text <chr> "Now, where to begin? How 'bout, ''Once upon a time…
## $ Sentence_Tokenized <chr> "['Now, where to begin?', \"How 'bout, ''Once upon …
## $ Tokenized <chr> "['now', 'begin', 'how', \"'bout\", '``', 'once', '…
The second step was readjusting a copy of the data frame fitting it to our purpose by:
selecting only those variables we are interested into for our analysis (i.e., “X”, “Year”, “Text”),
ordering the movies chronologically, readjusting variables’ labels,
counting the length in words of each text
texts_df<- df[, c("X", "Year", "Text")]
texts_df<- texts_df %>% rename(Title= X)
texts_df <- texts_df %>% arrange(Year)
for( i in rownames(texts_df) ){
string <- texts_df[i, "Text"]
count <- lengths(gregexpr("\\W+", string)) + 1
texts_df[i, "Lenght"] = count
}
An example of the final data frame is illustrated here
Text processed during the scraping phase is retrieved from the texts_df data frame.
text1 = "Sleeping_Beauty"
row_1 <- texts_df [match(text1, texts_df $Title ),]
string_1<- row_1$Text
text2 = "The_Sword_in_the_Stone"
row_2 <- texts_df [match(text2, texts_df $Title ),]
string_2 <- row_2$Text
text3= "Tangled"
row_3 <- texts_df [match(text3, texts_df $Title ),]
string_3 <- row_3$Text
Calculating sentiment scores cs of the two texts using Syuzhet library and its default method
v_1<- get_sentences(string_1)
v_2 <- get_sentences(string_2)
v_3 <- get_sentences(string_3)
sv_1<- get_sentiment(v_1, method="syuzhet")
sv_2<- get_sentiment(v_2, method="syuzhet")
sv_3<- get_sentiment(v_3, method="syuzhet")
rescaled_x_2
calculate a moving average for each vector of raw values. We’ll use a window size equal to 1/10 of the overall length of the vector.
#roll
wdw_1 <- round(length(sv_1)*.1)
rolled_1 <- zoo::rollmean(sv_1, k=wdw_1)
wdw_2 <- round(length(sv_2)*.1)
rolled_2 <- zoo::rollmean(sv_2, k=wdw_2)
wdw_3 <- round(length(sv_3)*.1)
rolled_3 <- zoo::rollmean(sv_3, k=wdw_3)
list_1 <- rescale_x_2(rolled_1)
list_2 <- rescale_x_2(rolled_2)
list_3 <- rescale_x_2(rolled_3)
sample_1 <- seq(1, length(list_1$x), by=round(length(list_1$x)/100))
sample_2 <- seq(1, length(list_2$x), by=round(length(list_2$x)/100))
sample_3 <- seq(1, length(list_3$x), by=round(length(list_3$x)/100))
#normalization for comparison
x1 <- 1:length(sv_1)
y1 <- sv_1
raw_1 <- loess(y1 ~ x1, span=.5)
line1 <- rescale(predict(raw_1))
x2 <- 1:length(sv_2)
y2 <- sv_2
raw_2 <- loess(y2 ~ x2, span=.5)
line2 <- rescale(predict(raw_2))
x3 <- 1:length(sv_3)
y3 <- sv_3
raw_3 <- loess(y3 ~ x3, span=.5)
line3 <- rescale(predict(raw_3))
sample_1 <- seq(1, length(line1), by=round(length(line1)/100))
sample_2 <- seq(1, length(line2), by=round(length(line2)/100))
sample_3 <- seq(1, length(line3), by=round(length(line3)/100))
plot(line1[sample_1],
type="l",
col="blue",
xlab="Narrative Time (sampled)",
ylab="Emotional Valence"
)
lines(line2[sample_2], col="orange")
lines(line3[sample_3], col="green")
legend(75, 1, legend=c(text1, text2, text3),
col=c("blue", "orange", "green"), lty=1:1, cex=0.5,
title="Movies", text.font=4, bg='white')
COse interessanti: tangled è inteessante perché inizia come il primo e termina con un arco molto simile al secondo
1959 e 1963 perché sono vicini e poi vedere che cambia in 2010
Inizio abbastanza analogo, poi peter pan e treasure planet diventano gli opposti. Quelli prima del 2000 sono più concitati/altalenanti
Si nota che i film con protagoisti maschili tendono ad avere curve simili e dei finali molto più positivi rispetto alle colleghe femminucce perché ai maschi deve sempre andare meglio. Mentre quelli femminili hanno la caratteristica di partir positivi n el primo quinto della trama e terminarecon un valore di almeno 0.5 inferiore al punto di inizio
I film prodotti dopo l’inizio del 2000 hanno un final che tende ad un valore negativo nella sentiment, mentre alla fine degli anni ’90 si teneva comunque un finale positivo. L’ultimo film ha una curva piùà “complessa” rispetto agli altri. Ad eccezione di Dinosaur sembrano partire da una nota molto positiva, benche (moana a parte) anche BB e Tarzano scendono molto velocemente sotto lo zero, mantenedo valori negativi fino alla fine del film.
Interessante qui perchè il film del 2007 ha un inizio neutrale e scende subit, mentre il film dl 2014 parte in modo negativo ma tende ad alzarsi subito e mantenere un valore in media più alto. Le due linee restano comunque molto simili e si può dedurre che c’è una connotazione positiva nei dialoghi dei cartoni che potrebbe essere indicativa dello spirito culturale dei primi anni duemila che aveva una visione per la maggiore positivista nei confronti dell’inovazione tecnologica
Pur tenendo conto della grossa disparità di incidenza di testo parlato sulla musica (cinderella molto testo, gli altri molta musica) bambi e cenerella sono moltopiù simili rispetto a MMM perché l’ultimo è una compilation di 10 shortfilm quindi non possiamo considerare l’andamento della sentiment di quest’ultimo come rapppresentativo di una trama ma dell’editing fatto dai produttori, questo lo rende più deliberato in termini di valore emozionale, perché non deve seguire una trama ore impostato dalla favola come succede ad esempio in cenerentola chè soggetto a costrizioni date dalla sua natura di favola tradizionale. Esemplare dell’incidenza delle scelte artistiche dei produttori/chi cazzo ha fatto sto coso
Nonostante inizino con valori diversi (anche molto) a metà del film entrambi raggiungono l’apice di positività e da questo momento in poi tendo ad avere un ndamento simile e discendente
Tutti i film hanno valori molto simili sia all’inizio che alla fine, i due più recenti che dimostrano un andamento paragonabile con picchi positivi verso la metà del film aventi prma e dopo i picchi negativi, mentre il più aniano è moltopiù lineare perché dopo un momento di discesa con picco(-) a metà risale e basta fino alla fine
Abbiamo estratto per alcuni delle valuable insights, siamo consci delle limitazioni dovute alle nostre conoscenze nel campo di data science dei dati e della limitatezza del nostro studio.
Ciònonostante sono venute fuori delle considerazioni interessanti grazie all’intersezione tra i dati rilevati e le nostre conoscenze in campo storico, sociale e culturale che ci hanno permesso di rtrarre conseguenze sull’andamento di film con protagonisti maschili e femminili, l’evoluzione dei costumi e della societa cpme per il discorso fuuro/niuove tecnologie.
Speroiamo che questo studio iniziale di natura esplorativa possa essere spunto per futuri approfondimenti di quello che un campo, ovvero il linguaggio nei film per bambini, a nostro avviso poco esplorato, ma importante perché incidente nell’evoluzione e nella formazione di piccoli esseri umani, al contempo fungendo da cartina tornasole dei trend sociali della loro epoca